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Summarisation 
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Abstract —Many visual surveillance tasks, e.g. video summarisation, is conventionally accomplished through analysing imagery- 
based features. Relying solely on visual cues for public surveillance video understanding is unreliable, since visual observations 
obtained from public space CCTV video data are often not sufficiently trustworthy and events of interest can be subtle. We believe 
that non-visual data sources such as weather reports and traffic sensory signals can be exploited to complement visual data for 
video content analysis and summarisation. In this paper, we present a novel unsupervised framework to learn jointly from both 
visual and independently-drawn non-visual data sources for discovering meaningful latent structure of surveillance video data. 
In particular, we investigate ways to cope with discrepant dimension and representation whilst associating these heterogeneous 
data sources, and derive effective mechanism to tolerate with missing and incomplete data from different sources. We show that 
the proposed multi-source learning framework not only achieves better video content clustering than state-of-the-art methods, but 
also is capable of accurately inferring missing non-visual semantics from previously-unseen videos. In addition, a comprehensive 
user study is conducted to validate the quality of video summarisation generated using the proposed multi-source model. 

Index Terms —Multi-source data, heterogeneous data, visual surveillance, clustering, event recognition, video summarisation. 
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Visual features and descriptors are often carefully de¬ 
signed and exploited as the sole input for surveillance 
video content analysis and summarisation. For instance, 
optical or particle flow is typically employed in activity 
modelling [1], [2], [3], foreground pixel feature is used 
for multi-camera video understanding [4], space-time image 
gradient is adopted for crowd analysis [5], and mixture of 
dynamic textures is used for video segmentation [6] and 
anomaly detection [7]. 

A critical task in visual surveillance is to automatically 
make sense of massive amount of video data by summaris¬ 
ing its content using higher-level intrinsic physical events 1 
beyond low-level key-frame visual feature statistics and/or 
object detection counts. In most contemporary techniques, 
low-level imagery visual cues are typically exploited as the 
only information source for video summarisation [8], [9], 
[10], [11], [12]. On the other hand, in complex and cluttered 
public scenes there are intrinsically more interesting and 
salient higher-level events that can provide more mean¬ 
ingful and concise summarisation of the video data. How¬ 
ever, such events may not be visually well-defined (easily 
detectable) nor detected reliably by visual cues alone. In 
particular, surveillance visual data from public spaces is 
often inaccurate and/or incomplete due to uncontrollable 
sources of variation, changes in illumination, occlusion, and 
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background clutters [13]. 

In this study, we wish to exploit non-visual auxiliary 
information to complement the unilateral perspective from 
visual observations. Examples of non-visual sources include 
weather report, GPS-based traffic data, geo-location data, 
textual data from social networks, and on-line event sched¬ 
ules. The auxiliary data sources are beneficial to visual data 
modelling because despite that visual and non-visual data 
may have very different characteristics and are of different 
natures, they depict the common physical phenomenon in 
a scene. They are intrinsically correlated, although may 
be mostly indirect in some latent spaces. Effectively dis¬ 
covering and exploiting such a latent correlation space can 
facilitate the underlying data structure discovery and bridge 
the semantic gap between low-level visual features and 
high-level semantical interpretation. 

Challenges - Nevertheless, it is non-trivial to formulate a 
framework that exploits both visual and non-visual data for 
video content analysis and summarisation, both algorithmi¬ 
cally and in practice. 

Algorithmically, unsupervised mining of latent correla¬ 
tion and interaction between heterogeneous data sources 
faces a number of challenges: (1) Disparate sources signif¬ 
icantly differ in representation (continuous or categorical), 
and largely vary in scale and covariance 2 . In addition, the 
dimension of visual sources often exceeds that of non-visual 
information to a great extent (>2000 visual dimensions vs. 
<10 non-visual dimensions). Owing to this dimensionality 
discrepancy problem, a straightforward concatenation of 
features will result in a representation unfavourably inclined 
towards the imagery data. (2) Both visual and non-visual 
data in isolation can be inaccurate and incomplete. 

2. Also known as the heteroscedasticity problem [14]. 
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Fig. 1 . The overview of the proposed multi-source driven video summarisation framework. We consider a novel setting where multiple 
heterogeneous sources are present during the model training stage. The proposed Multi-Source Clustering Forest discovers and exploits 
latent correlations among heterogeneous visual and non-visual data sources both of which can be inaccurate and not trustworthy. In 
deployment, our model uncovers visual content structures and infer semantic tags on previously-unseen video data for video summarisation. 


In practice, auxiliary data sources, e.g. weather, traffic 
reports, and event time tables, may be rather unreliable in 
availability. Specifically, the reports may not be released on- 
the-fly at a synchronised time stamp with the surveillance 
video stream. In addition, existing video control rooms may 
not necessarily have direct access to these sources. This 
renders models that expect complete visual and non-visual 
information during deployment impractical. 

Our solution - In this study, we address this multi-source 
learning problem in the context of video summarisation, 
conventionally based on visual feature analysis and object 
detection/segmentation. In particular, we formulate a novel 
framework that is capable of performing joint learning 
given heterogeneous multi-sources (Fig. 1). We consider 
visual data as the main source and non-visual data as 
the auxiliary sources , since we believe visual information 
still plays the main role in video content analysis. During 
training, we assume the access to both visual and non-visual 
data. The model performs multi-source data clustering and 
discovers a set of visual clusters tagged along with non¬ 
visual data distribution, e.g. different weathers and traffic 
speeds. We term the model as multi-source model. During 
the deployment stage, we only assume the availability of 
previously-unseen video data since non-visual data may 
not be accessible due to the aforementioned limitations. 
Since the learned model has already captured the latent 
structure of heterogeneous types of data sources, the model 
can be used for semantic video clustering and non-visual 
tag inference on previously-unseen video sequence, even 
without the non-visual data. Subsequently, key clips are 
automatically selected from the discovered clusters. The 
final summary video can be produced by chronologically 
compositing these key clips enriched by the inferred tags. 


Contributions - The main contributions of this work are: 

1) We propose a unified multi-source learning frame¬ 
work capable of discovering semantic structures of 
video content collectively from heterogeneous visual 
and non-visual data. This is made possible by formu¬ 
lating a novel Multi-Source Clustering Forest (MSC- 
Forest) that seamlessly handles multi-heterogeneous 
data sources dissimilar in representation, distribution, 
and dimension. Although both visual and non-visual 
data in isolation can be inaccurate and incomplete, 
our model is capable of uncovering and subsequently 
exploiting the shared latent correlation for better data 
structure discovery. 

2) The model is novel in its ability to accommodate 
partial or completely missing non-visual sources. 
In particular, we introduce a joint information gain 
function that is capable of dynamically adapting to 
arbitrary amount of missing non-visual information 
during model learning. In model deployment, only 
visual input is required for inferring missing non¬ 
visual semantics. 

Extensive comparative evaluations are conducted on two 
public surveillance videos captured from both indoor and 
outdoor environments. Comparative results show that the 
proposed model not only outperforms the state-of-the-art 
methods [15], [16] for video content clustering and struc¬ 
ture discovery, but also is more superior in predicting non¬ 
visual tags for previously-unseen videos. The robustness of 
the proposed model is further validated by a user study on 
video summary quality. 

1 Related Work 

Multi-modality learning - There exist studies that exploit 
different sensory or information modalities from a single 









































3 


source for data structure mining. For example, Cai et 
al. [17] propose to perform multi-modal image clustering by 
learning a commonly shared graph-Laplacian matrix from 
different visual feature modalities. Heer and Chi [18] com¬ 
bine linearly individual similarity matrices derived from 
multi-modal webpages for web user grouping. Karydis et 
al. [19] present a tensor based model to cluster music 
items with additional tags. In terms of video analysis, 
the auditory channel and/or transcripts have been widely 
explored for detecting semantic concepts from multimedia 
videos [20], [21], summarising highlights in news and 
broadcast programs [22], [23], or locating speakers [24]. 
User tags associated with web videos (e.g. YouTube) have 
also been utilised [25], [26], [27]. In contrast, surveillance 
videos captured from public spaces are typically without 
auditory signals nor any synchronised transcripts and user 
tags available. Instead, we wish to explore alternative non¬ 
visual data drawn independently elsewhere from multiple 
sources, with inherent challenges of being inaccurate and 
incomplete, unsynchronised to and may also be in conflict 
with the observed visual data. 

Multi-source learning - An alternative multi-source learn¬ 
ing mechanism can be clustering ensemble [28], [29] where 
a collection of clustering instances is generated and then 
aggregated into the final clustering solution. Typically only 
single data source is considered, but it can be easily 
extended to handle multi-source data, e.g. creating a re¬ 
spective clustering instance for each source. Nonetheless, 
cross-source correlation is ignored since the clustering 
instances are separately formed and no interaction between 
them is involved. A closer approach to ours is the Affin¬ 
ity Aggregation Spectral Clustering (AASC) [15], which 
learns data structure from multiple types of homogeneous 
information (visual features only). Their method generates 
independently multiple affinity data matrices by exhaustive 
pairwise distance computation for every pair of samples in 
every data source. It suffers from unwieldy representation 
given high-dimensional data inputs. Importantly, despite 
that it seeks for optimal weighted combination of distinct 
affinity matrices, it does not consider correlation between 
different sources in model learning, similar to clustering 
ensemble [28], [29]. Differing from the above models, our 
Multi-Source Clustering Forest overcomes these problems 
by generating a unified single affinity matrix that cap¬ 
tures latent correlations among heterogeneous types of data 
sources. Furthermore, our model has a unique advantage in 
handling missing non-visual data over [28], [29], [15]. 

Video summarisation - Contemporary video summarisa¬ 
tion methods can be broadly classified into two paradigms, 
key-frame-based [11], [30], [31], [32], [33] and object- 
based [9], [10], [34] methods. The key-frame-based ap¬ 
proaches select representative key-frames by analysing low- 
level imagery properties, e.g. optical flow [30] or image dif¬ 
ferences [31], object’s appearance and motion [11], to form 
a storyboard of still images. Object-based techniques [9], 
[10], on the other hand, rely on object segmentation and 
tracking to extract object-centric trajectories/tubes, and 


compress those tubes to reduce spatiotemporal redundancy. 

Both the above schemes utilise solely visual information 
and make implicit assumptions about the completeness and 
accuracy of the visual data available in extracting features 
or object-centered representations. They are unsuitable nor 
scalable to complex scenes where visual data are inherently 
incomplete and inaccurate, mostly the case in surveillance 
videos. Our work differs significantly to these studies in 
that we exploit not only visual data without object tracking, 
but also non-visual sources as complementary information. 
The summary generated by our approach is semantically 
enriched - it is labelled automatically with semantic tags, 
e.g. traffic condition, weather, or event. All these tags 
are learned from heterogeneous non-visual sources in an 
unsupervised manner during model training without any 
manual labels. 

Random forests - Random forests [35], [16] have proven 
as powerful models in the literature. Different variants of 
random forests have been devised, either supervised [36], 
[37], [38], [39], [40], or unsupervised [41], [42], [43], [44], 
[45]. Supervised models are not suitable to our problem 
since we do not assume the availability of ground truth 
labels during model training. Existing clustering forest 
models, on the other hand, assumes only homogeneous data 
sources such as pure imagery-based features. No principled 
way of combining multiple heterogeneous and independent 
data sources in forest models is available. 

2 Multi-Source Clustering 

Video summarisation by content abstraction aims to gen¬ 
erate a compact summary composed of key/interesting 
content from a long previously-unseen video for achiev¬ 
ing efficient holistic understanding [32]. A common way 
to establish a video summary is by extracting and then 
combining a set of key frames or shots. These key contents 
are usually discovered and selected from clusters of video 
frames or clips [32]. 

In this study, we follow the aforementioned approach but 
consider not only visual content of video, but also a large 
corpus of non-visual data collected from heterogeneous in¬ 
dependent sources (Fig. 2(a)). Specifically, through learning 
latent structure of multi-source data (Fig. 2(b-c)), we wish 
to make reference to and/or impose non-visual semantics 
directly into video clustering without any human manual 
annotation of video data (Fig. 2(d)). Formally, we consider 
the following different data sources that form a multi-source 
input feature space: 

Visual features - We segment a training video into N either 
overlapping or non-overlapping clips, each of which has a 
duration of T c i ip seconds. We then extract a d-dimensional 
visual descriptor from the ith video clip denoted by = 

X itd ) eR d ,i = l,...,N. 

Non-visual data - Non-visual data are collected from het¬ 
erogeneous independent sources. We collectively represent 
m types of non-visual data associated with the ith clip as 

y i = (ViX, • • •, Vi,m ) € R m , i = 1,..., N. Note that any 
(or all) dimension of may be missing. 
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Fig. 2. Multi-source model training stage: The pipeline 
of performing multi-source clustering on visual and 
non-visual data with the proposed Multi-Source Clus¬ 
tering Forest (MSC-Forest). 


We aim at formulating a unified clustering model capable 
of coping with the few challenges as highlighted in Section . 
The model needs be unsupervised since no ground truth is 
assumed. To mitigate the heteroscedasticity and dimension 
discrepancy problems, we require a model that can isolate 
the very different characteristics of visual and non-visual 
data, yet can still exploit their latent correlation in the 
clustering process. To handle noisy data, feature selection 
is needed and necessary. 

In light of the above demands, we choose to start with 
the clustering random forest [35], [41], [42] due to (1) 
unsupervised information gain optimisation thus requiring 
no ground truth labels; (2) its flexible objective function 
for facilitating the modelling of multi-source data as well 
as the processing of missing data; (3) and its implicit 
feature selection mechanism for handling noisy features. 
Nevertheless, the conventional clustering forest is not well 
suited to solve these challenges since it expects a full 
concatenated representation as input during both model 
training and deployment. This does not conform to the 
assumption of only visual data being available during model 
deployment for previously-unseen videos. Moreover, due 
to its uniform variable selection mechanism [35] (e.g. each 
feature dimension has the same probability to be selected as 
a candidate optimal splitting variable), there is no principled 
way to ensure balanced contribution from individual visual 
and non-visual sources in the node splitting process. To 
overcome these limitations, we propose a new Multi-Source 
Clustering Forest (MSC-Forest) by introducing a new ob¬ 
jective function allowing joint optimisation of individual 
information gains of different sources. We first describe the 
conventional forests prior to detailing the proposed MSC- 
Forest. 


2.1 Conventional Random Forests 


Classification forests - A general form of random forests 
is the classification forests. A classification forest [35], [46] 
is an ensemble of T c i ass binary decision trees T(x): -© 

with X the d-dimensional feature space, and 'R K = 
[0, 1] K denoting the space of class probability distribution 
over the label space C = {1,..., K}. 

Decision trees are learned independently of each other, 
each with a random subset X t of the training samples X = 
{xi}, i.e. bagging [35]. Growing a decision tree involves 
a recursive node splitting procedure until some stopping 
criterion is satisfied, e.g. leaf nodes are formed when no 
further split can be achieved given the objective function, 
or the number of training samples arriving at a node is 
smaller than the predefined node size, f. Small <j> leads to 
deep trees. We set f = 2 in our experiments for capturing 
sufficiently fine-grained data structure. At each leaf node, 
the class probability distribution is then estimated based on 
the labels of the arrival samples. 

The training of each internal/split node is a process of 
binary split function optimisation, defined as 



if x# x < #2. 

otherwise. 


( 1 ) 


This split function is parameterised by two parameters d = 
[$i, $ 2 ]: (i) a feature dimension x& ± with d\ G {1, ..., d}, 
and (ii) a feature threshold $2 £ M. All samples of a split 
node s will be channelled to either the left l or right r child 
nodes, according to the output of Eqn. (1). 

The optimal split parameter d* is chosen via 


= argmax AX class , (2) 

© 


where 0 = S 1 represents a parameter set over 

m try randomly selected features, with S the sample set 
reaching the node s. The cardinality of a set is given by | • |. 
Typically, a greedy search strategy is exploited to identify 
d*. The information gain AX c i ass is formulated as 


AX 


class 


s |sp I s\ r ’ 


(3) 


where L and R denote the sets of data routed into l and r, 
and L U R = S. The information gain X can be computed 
as either the entropy or Gini impurity [47]. 

Clustering forests - In contrast to classification forests, 
clustering forests require no ground truth label information 
during the training phase. A clustering forest consists of 
Tciust binary decision trees. The leaf nodes in each tree 
define a spatial partitioning of the training data. Interest¬ 
ingly, the training of a clustering forest can be performed 
using the classification forest optimisation approach by 
adopting the pseudo two-class algorithm [35], [41], [42]. 
Specifically, we add N pseudo samples x = {xi,..., Xd} 
(Fig. 3(b)) into the original data space X (Fig. 3(a)), with 
Xi ~ Dist(x^) sampled from certain distributions Dist(x^). 
In the proposed model, we adopt the empirical marginal 
distributions of the feature variables owing to its favourable 
performance [42]. With this data augmentation strategy, 
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Fig. 3. An illustration of clustering toy data with a clustering forest, (a) Original toy data are labelled as class 
1, whilst (b) the pseudo-points (red +) as class 2. (c) A clustering forest performs two-class classification in the 
augmented space, (d) The final data partitions on the original data. 


the clustering problem becomes a canonical classification 
problem that can be solved by the classification forest 
training method as discussed above. The key idea behind 
this algorithm is to partition the augmented data space into 
dense and sparse regions (Fig. 3(c-d)) [41]. 


2.2 Multi-Source Clustering Forest 

Conventional clustering forests assumes only homogeneous 
data sources such as pure imagery-based features. In con¬ 
trast, the proposed Multi-Source Clustering Forest can take 
heterogeneous sources as input. In particular, the proposed 
model uses visual features as splitting variables to grow 
Multi-Source Clustering trees (MSC-trees) as in Eqn. (1), 
and exploits non-visual information as additional data to 
help determining the tl) = [^ 1 ,^ 2 ]- In this way, auxiliary 
non-visual information is used, in addition to visual data, 
to guide the tree formation. 

Formally, we define a new joint information gain function 
for node splitting during training MSC-trees as: 


A 1 = 


AX, 



^ A X Al t 
E ~T~ + at ~T ' ^ 


3 =1 


L j 0 


non-visual 


Similar to Eqn. (3), the optimal parameter corresponds to 
the split with the maximal A 1. This formulation defines the 
best data split across the joint space of multi-source data, 
beyond visual domain alone. All the terms in Eqn. (4) are 
interpreted as below. 

Visual term : A l v = AX c i ass (Eqn. (3)) denotes the in¬ 
formation gain in visual domain. Precisely, this measure 
is computed from the pseudo class labels. Therefore, it 
reflects the visual data structure characteristics given that 
the pseudo data samples are drawn from the marginal 
feature distributions (Section 2.1). In this study we utilise 
the Gini impurity Q [47] to estimate AX c i ass by setting 
1 — Q in Eqn. (3) due to its simplicity and efficiency. The 
Gini impurity is computed as Q = '}2 i -^jPiPj, with pi and 
Pj being the proportion of samples belonging to the ith and 
jth category in a split node s. High value in Q indicates 
pure category distribution. 

Non-visual term : This is a new term we introduce as 
auxiliary information on visual term. More specifically, 


A lj denotes the information gain in the jth non-visual 
data. A non-visual source can be either categorical or 
continuous. For a categorical non-visual source, similar to 
visual term we use the Gini impurity Q as its data split 
measure criterion. In the case of non-visual source with 
continuous values, we adopt least squares regression [47] 
to enforce continuity in the clustering space: 

1 \s\ |S| 

n= \S\ ^ Vi ’ j ~ |5f X! (5) 

where ytj represents the value in the jth non-visual space 
associated with the it h sample G S, and S is the set of 
samples reaching node s. That is A lj = 7 Z. 

Temporal term : We also add a temporal smoothness gain 
Al t to encourage temporally adjacent video clips to be 
grouped together. This temporal information helps in min¬ 
ing visual data structure. 

The information gain by different sources may live 
in very disparate ranges due to the different natures of 
source, each term of Eqn. (4) is therefore normalised by its 
initial data impurity denoted by l v 0, Ijo, and l t 0. These 
impurities are obtained at the root node of every MSC- 
tree. The source weights are denoted by a V9 and a t 
accordingly, holding a v + J2iLi a i + oc t = 1- We set 
a v = 0.5 obtained by cross-validation. A detailed analysis 
on a v is given in Section 5.2. For non-visual and temporal 
information, we uniformly assign a t = ol{ = since 

their importance is not known in prior , with m the number 
of non-visual sources. 

The role of different source data - Given the main role and 
much more stable provision of the visual source in video 
understanding, non-visual data are regarded as auxiliary 
information over visual source. During the training of 
MSC-Forest, the split functions (Eqn. (1)) are defined on 
visual features, but $ = [^ 1 ,^ 2 ] is collectively determined 
by visual features and the associated non-visual as well as 
temporal information (i.e. the non-visual and temporal term 
in Eqn. (4)). Alternatively, one can think of that the main 
visual data source is ‘completely-visible’ to the MSC-Forest 
since it is needed during both forest training and evaluation, 
whilst the auxiliary non-visual data are ‘half-visible’ in 
that they are exploited as side information for embedding 
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their knowledge into the MSC-tree growing during model 
training but not required any more during the MSC-Forest 
evaluation (due to their restricted availability as explained 
in Section ). 

Joint information gain - We interpret the intrinsic advan¬ 
tage of the joint information gain defined by Eqn. (4), with 
comparison against the naive feature concatenation strategy. 
With the latter scheme, the information gain (Eqn. (3)) is 
directly estimated in a heterogeneous joint space where 
visual, non-visual and temporal data are mixed together. 
This would suffer from the heteroscedasticity problem, 
as discussed in Section . Instead, Eqn. (4) overcomes 
this challenge by modelling different sources via separate 
information gain terms, resulting in a more balanced ex¬ 
ploitation of multi-source data. In this way, the proposed 
joint information gain of multi-source data encourages 
more appropriate visual data separation both visually and 
semantically. This formulation is the essential contribution 
of our proposed MSC-Forest model. 

The merits of MSC-Forest - The formulation in Eqn. (4) 
brings two unique benefits: (A) Thanks to the informa¬ 
tion gain optimisation, the influences of visual and non¬ 
visual domains on data partitioning can be better balanced 
compared to naive feature concatenation. (B) Eqn. (2) and 
Eqn. (4) together provide a mechanism to discover strongly 
correlated heterogeneous source pairs and to exploit joint 
information gain of such correlated pairs for data par¬ 
titioning. In other words, only selective visual features 
(Eqn. (2)) that yield high information gain collectively with 
non-visual information (Eqn. (4)) will contribute to the 
MSC-tree growing. Such a mechanism cannot be realised 
using the conventional clustering forests [35], [41]. We shall 
demonstrate the multi-source correlation discovered by our 
proposed MSC-Forest in experiments (Section 5.4). 

2.2.1 Coping with Partial/Missing Non-Visual Data 
We introduce a new adaptive weighting mechanism to 
dynamically deal with the inevitable partial/missing non¬ 
visual data 3 . Specifically, when some non-visual data are 
missing and suppose the missing proportion of the it h non¬ 
visual type in the training set for MSC-tree t is Si, we 
reduce its weight from ol{ to oli — S^. The total reduced 
weight Yi $i a i is then distributed evenly to the weights of 
all sources to ensure a v + YiLi a i + a t = 1- This linear 
adaptive weighting method produces satisfactory results in 
our experiments. 

2.2.2 Model Complexity 

The upper-bound learning complexity of a whole MSC- 
Forest can be examined from its constituent parts, i.e. at 
tree- and node-levels. Formally, given a MSC-tree t, we 

3. There exist missing data filling algorithms utilised in conventional 
random forests, e.g. for the missing value of one feature in one class, the 
median value (continuous) or the most frequent category (discrete) of this 
feature over the current class can be used as the estimation [48]. Whilst 
a similar strategy is possible to apply on our MSC-Forest, we consider 
an alternative by proposing an effective adaptive weighting algorithm in 
order not to further introduce noisy training data. 


denote the set of all the split nodes as lit and the sample 
subset used for training a split node j G lit as Sj. The 
training complexity of j-th node is given by m try (\Sj\ — 
1 )u, when a greedy search algorithm is adopted, with ra try 
the number of features attempted to partition Sj, and u the 
running time of conducting one data splitting operation. 
Consequently, the overall computational cost of learning a 
MSC-Forest can be computed as 

E E (l^'j I ■)« m tryU E ^(15,1 1). 
t jeu t t jeu t 

( 6 ) 

The value of parameter ra try is identical across all MSC- 
trees. The learning time is thus determined by (1) the value 
of u , and (2) the factor that we name as tree fan-in 

m = E \ s i - fi 

jeu t 

Clearly, m of a MSC-Forest is larger than that of con¬ 
ventional forests since we need to compute additional 
information gains of non-visual and temporal information 
(Eqn. (4)). On the other hand, the value of T>(t) primarily 
relies on the tree structure/topological characteristics [49]: 
a balanced and shallower tree has smaller T>(t), thus the 
tree shall be more efficient in training and inference on 
previously-unseen samples, in that the paths from the root 
to leaf nodes are relatively shorter. In Section 5.5, we will 
show that the additional non-visual information encourages 
more balanced and shallower decision trees than learning 
from single visual source alone. 

2.3 Latent Multi-Source Data Structure Discovery 

The multi-source feature space has high dimension (over 
2000 dimensions). This makes learning data structure by 
clustering computationally difficult. To this end, we con¬ 
sider spectral clustering on manifold to discover latent 
clusters in a lower dimensional space. Fig. 2 depicts the 
pipeline of our video data clustering approach based on the 
learned MSC-Forest. 

The spectral clustering [50] groups data using eigen¬ 
vectors of an affinity matrix derived from the data. The 
goodness of the resulting cluster formation primarily relies 
on the quality of the input affinity matrix which reflects 
and embeds the essential data structures [45]. Below we 
describe the details of constructing multi-source referenced 
affinity matrix from MSC-Forest. Intuitively, the multi¬ 
source learning nature of MSC-Forest renders its data 
similarity measure sensitive to the joint knowledge from 
diverse source data. 

The learned MSC-Forest offers an effective way to derive 
the required affinity matrix. Specifically, each individual 
tree within the MSC-Forest partitions the training samples 
at its leaves f(x):M (i GLcAf, where i represents a leaf 
index and L refers to the set of all leaves in a given tree. For 
each MSC-tree, we first compute a tree-level NxN affinity 
matrix A 1 with elements defined as A\j m exp _dlst ( Xi,x ^ 
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where 


dist(xi, 



0 

+oo 


if £(xi) = £(xj), 

otherwise. 


( 8 ) 


We assign the maximum affinity (affinity=l, distance=0) 
between points and Xj if they fall into the same leaf, and 
the minimum affinity (affinity=0, distance=l) otherwise. A 
smooth affinity matrix can be obtained through averaging 
all the tree-level affinity matrices 


A = 


1 


^clust 


^clust 


s>‘. 


(9) 


Eqn. (9) is adopted as the ensemble model of MSC- 
Forest due to its advantage of suppressing the noisy tree 
predictions, though other alternatives such as the product 
of tree-level predictions are possible [16]. We then construct 
a sparse k-NN graph, whose edge weights are defined by 
A (Fig. 2(c)). 

Subsequently, we symmetrically normalise A to obtain 
S = D~ 2 AD~ 2 , where D denotes a diagonal degree 
matrix with elements D^i = Aij. Given S , we 
perform spectral clustering to discover the latent clusters of 
training clips with the number of clusters automatically de¬ 
termined through analysing the eigenvector structure [50]. 
Each training clip x$ is then assigned to a cluster q G C, 
with C the set of all clusters. 

The learned clusters group similar clips both visually and 
semantically, with each of the clusters associated with a 
unique distribution for each non-visual data (Fig. 2(d)). We 
denote the distribution of the it h non-visual data type of 
the cluster c as 


P(Vi\c) OcJ2 ^ P(yi\ X j)’ ( 10 ) 

Xj kz-2\. c 

where X c represents the set of training samples in c. These 
multi-source data clusters form a component of our multi¬ 
source model (Fig. 1). 


3 Semantic Video Summarisation 

In Section 2 we presented multi-source data clustering by 
learning a Multi-Source Clustering Forest (MSC-Forest), 
resulting in a semantic cluster formation. Once this multi¬ 
source model is learned, it can be deployed for semantic 
video summarisation. Specifically, we follow the estab¬ 
lished approach of summarising videos by clustering [32] 
but with the introduction of two noticeable differences in 
our method. 

First, our video summary is multi-source referenced. 
Specifically, the MSC-Forest is trained on heterogeneous 
sources, its optimised split functions {h} (Eqn. (1)) there¬ 
fore implicitly capture the complex multi-source structures. 
When one deploys the trained model for content summari¬ 
sation of previously-unseen video data, the model only 
needs to take visual inputs without any non-visual data 
sources. And yet it is able to induce video content partitions 
that not only correspond to visual feature similarities, but 
also are consistent with meaningful non-visual semantic 
interpretations. Second, our video summary is automatically 


tagged as the result of model inference. This is made 
possible through exploiting the non-visual data distributions 
associated with the discovered clusters on the training data 
(see Eqn. (10) and Fig. 2(d)). Below we discuss the details 
of generating a semantic video summary. 


3.1 Key-Clip Extraction and Composition 

Suppose we are given a previously-unseen surveillance 
video footage without meta-data tagging/script. The video 
is pre-processed by segmenting it into a set of M either 
overlapping or non-overlapping short clips {x*}££ x with 
equal duration. Our aim is to first assign cluster membership 
to each previously-unseen clip using the trained multi¬ 
source model, and then select key-clips from the resulting 
clusters 4 . The chosen key-clips are then chronologically 
ordered to construct a video summary. 

Clustering previously-unseen video clips - Inferring clus¬ 
ter memberships of previously-unseen clips is an intri¬ 
cate task. A straightforward method is to assign cluster 
membership by identifying the nearest cluster c* G C to 
a sample x*, where C represents the set of clusters we 
discovered in Section 2.3. However, we found this hard 
cluster assignment strategy susceptible to outliers in C 
and source noise. To mitigate this problem, we consider 
an alternative approach by utilising the MSC-Forest tree 
structures for soft cluster assignment. This is more robust 
to either source noise or outliers. 

Fig. 4 depicts the soft cluster assignment pipeline. First, 
we trace the leaf ^(x*) of each tree t where x* falls by 
channelling x* into the tree (Fig. 4(a)). This step is critical 
as it establishes a connection for x* with an appropriate 
training subset X^( x *) using the split functions {h} t op¬ 
timised by multi-source data. Here, X^( x *) represents the 
set of training samples associated with ^(x*). The set is 
consistent with x* both visually and semantically since they 
encompass identical response w.r.t {h} t . 

Second, we retrieve the cluster membership C t = {cf} C 
C of X^ t(x *), against which we search for the tree-level 
nearest cluster c* t for x* (Fig. 4(b)) via 


c* t = argmin ceCt ||x* - /j, c \\, (11) 


with t the tree index, and fi c the centroid of cluster c, 
estimated as 


Me 


1 


X Xi ’ 

XiGX c 


( 12 ) 


where X c represents the set of training samples in c. 
Performing nearest cluster search within C t rather than the 
whole cluster space C brings a key benefit: since the search 
space is constrained by MSC-tree, it is more meaningful 
and also less noisy than the entire space C, leading to more 
accurate c[ estimation. 


4. It is worth noticing that the purpose of this clustering step is 
completely different from the multi-source data clustering during model 
training, as presented in Section 2.3. The latter is a component of our 
multi-source model training pipeline (Fig. 2), whilst the former aims at 
revealing the latent structure over testing data for video summarisation. 
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Fig. 4. The pipeline of our multi-source referenced key-clips detection algorithm, (a) Channel a clip x* into MSC-trees. (b) Search tree-level 
nearest clusters of x*, hollow circle denotes cluster, (c) Predict the final nearest cluster. A red * depicts a representative previously-unseen 
clip. 


Once we obtain all tree-level nearest clusters from all the 
trees in the forest, {cj }^ st , the final nearest cluster c* is 
obtained as the one with maximal votes from all the trees 
(Fig. 4(c)) 

c* = max{c ( *}f: 1 “ t (13) 

By repeating the above steps on all previously-unseen clips 
we obtain their cluster labels as CC = { c*}fL 1 
(Fig. 4(e)). 

Extracting key-clips - With the assigned cluster member¬ 
ships CC on all previously-unseen clips, the key-clip of a 
previously-unseen video data cluster c* can be represented 
by the representative previously-unseen clip r* that is 
closest to the cluster centroid /i c * (Fig. 4(e)). Concate¬ 
nating these key-clips chronologically establishes a visual 
summary. Such a summary, however, is likely to be discon¬ 
tinuous in preserving visual context therefore non-smooth 
visually due to abrupt changes between adjacent key-clips. 
To enforce some degrees of smoothness in the visualisation 
of video summary whilst minimising redundancy, we adopt 
a shortest path strategy [51] to induce an optimal path 
between two temporally-adjacent representative r* on a 
graph G. This approach produces a visually more coherent 
video summary whilst discards as much redundancy as 
possible. 

More precisely, we construct a graph G = ( V , E ), where 
V and E indicate the set of previously-unseen video clip 
vertices and edges (Fig. 4(d)). The weights of edges can be 
efficiently estimated using Eqn. (8) and (9). Note that the 
graph G is also multi-source referenced since it is derived 
from our multi-source MSC-Forest model. We then perform 
shortest path search between temporally-adjacent r* on G 
(Fig. 4(f)) and all the samples that lie on the shortest paths 
compose the final key-clip set /C (Fig. 4(g)). 


Algorithm 1: Infer non-visual tags of previously- 
unseen clips. 

Input: A previously-unseen clip x*, a trained 
MSC-Forest, training data clusters C; 

Output: Predicted tag 

1 Initialisation: 

2 Compute p(yi\c) for each training data cluster 
(Eqn. (10)); 

3 Compute cluster centroid fi c (Eqn. (12)); 

4 Non-Visual Tag Inference: 

5 for t <— 1 to T c i ust do 

6 Trace the leaf <ft(x*) where x* falls (Fig. 4(a)); 

7 Retrieve the training samples associated 

with £ t ( x *); 

8 Obtain the clusters ft = {q} C C of 

9 Search the tree-level nearest cluster c* t of x* 
within Ct (Eqn. (11)); 

10 end 

n Estimate tag distribution p(^|x*) (Eqn. (14)); 

12 Compute the final tag & (Eqn. (15)). 


3.2 Video Tagging 

Summarising video with high-level interpretation requires 
plausible semantic content inference from video data x*. 
We derive a tree-structure aware tag inference algorithm 
capable of predicting tag types same as training non-visual 
data, based on the learned MSC-Forest and discovered 
training data clusters. Specifically, we first obtain the tree- 
level nearest cluster c* t of a previously-unseen sample x* 
using Eqn. (11). Second, the p(yi\c^) associated with c* t is 
utilised as the tree-level non-visual tag estimation for the 
ith non-visual data type. To achieve a smooth prediction, 
we average all p(yi\c = c*) obtained from individual trees 
as 

piyiW) = p ( yi ^ m 


(14) 
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The final tag yi for the it h non-visual type is obtained as 

Vi = argmax yi p(yi\x*). (15) 

With the above steps, we can estimate all m non-visual 
tags yiS with i £ {1,..., m}. The procedure of our tagging 
algorithm is summarised in Algorithm 1. 

Given the extracted key-clips JC and automatic assign¬ 
ment of non-visual semantic tags (Eqn. (15)), we can now 
construct a video summary by chronologically concatenat¬ 
ing each clip x* £ 1C with smooth inter-clip transition, 
e.g. crossfading, and labelling each clip with their inferred 
semantic tags. 

4 Experimental Settings 

Datasets - We conducted experiments on two datasets 
collected from publicly accessible webcams that feature an 
outdoor and an indoor scene respectively: (1) the Times 
Square Intersection (TISI) dataset, and (2) the Educational 
Resource Centre (ERCe) dataset 5 . There are a total of 7324 
video clips spanning over 14 days in the TISI dataset, whilst 
a total of 13817 clips were collected across a period of two 
months in the ERCe dataset. Each clip has a duration of 20 
seconds. The details of the datasets and training/deployment 
partitions are given in Table 1. Example frames are shown 
in Fig. 5. 

The TISI dataset is challenging due to severe inter¬ 
object occlusion, complex behaviour patterns, and large 
illumination variations caused by both natural and artificial 
light sources at different day time. The ERCe dataset is non¬ 
trivial due to a wide range of physical events involved that 
are characterised by large changes in environmental setup, 
participants, crowdedness, and intricate activity patterns. 

TABLE 1 

Details of datasets. FPS = frames per second. 


- 

Resolution 

FPS 

# Training Clip 

# Deployment Clip 

TISI 

550 x 960 

10 

5819 

1505 

ERCe 

480 x 640 

5 

9387 

4430 


Visual and non-visual sources - We extracted the follow¬ 
ing set of visual features for representing visual content 
in each clip: (a) colour features including RGB and HSV; 
(b) local texture features based on Local Binary Pattern 
(LBP) [52]; (c) optical flow; (d) holistic features of the 
scene based on GIST [53]; and (e) person and vehicle 6 
detection [54]. 

We collected 10 types of non-visual sources for the TISI 
dataset: (a) weather data extracted from the WorldWeath- 
erOnline with 9 elements: temperature, weather type, wind 
speed, wind direction, precipitation, humidity, visibility, 
pressure, and cloud cover; (b) traffic speed data from the 
Google Maps with 4 levels of traffic speed: very slow, slow, 
moderate, and fast. For the ERCe dataset, we collected data 
from multiple independent on-line sources about the time 

5. Datasets available: www.eecs.qmul.ac.uk/%7Exz303/download.html 

6. No vehicle detection on the ERCe dataset. 


table of campus events including: No Scheduled Event (No 
Schd. Event), Cleaning, Career Fair, Gun Forum Control 
and Gun Violence (Gun Forum), Group Studying, Scholar¬ 
ship Competition (Schlr. Comp.), Accommodative Service 
(Accom. Service), Student Orientation (Stud. Orient.). 

Note that other visual features and non-visual data types 
can be considered without altering the training and infer¬ 
ence methods of our model in that the MSC-Forest model is 
capable of coping with different families of visual features 
as well as distinct types of non-visual sources. 

Baselines - To evaluate the proposed method for multi¬ 
source video clustering and tag inference, we compared 
the Visual + Non-Visual + MSC-Forest ( VNV-MSC-Forest ) 
model against the following baseline models: 

1) VO-Forest : a conventional forest [35] trained with vi¬ 
sual feature vectors alone, to demonstrate the benefits 
from using non-visual sources 7 . 

2) VNV-Kmeans : k- means using concatenated vectors of 
visual and non-visual features, to highlight the het- 
eroscedasticity and dimensionality discrepancy prob¬ 
lem caused by heterogeneous visual and non-visual 
data. 

3) VNV-Forest : a conventional forest [35] trained with 
concatenated visual and non-visual feature vectors, 
to compare the effectiveness of MSC-Forest that 
exploits non-visual data during forest formation. 

4) VNV-AASC : a state-of-the-art multi-source spectral 
clustering method [15] learned by treating each type 
of visual or non-visual feature as an individual source, 
to demonstrate the superiority of MSC-Forest in 
handling diverse data representations and correlating 
multiple sources. 

5) VNV-MSC-Forest-hard\ a variant of our model using 
hard cluster assignment strategy for inferring seman¬ 
tic tags of previously-unseen samples (Section 3.2), 
to highlight the effectiveness of the proposed tree 
structure based tag inference algorithm. 

6) VT-MSC-Forest : a variant of our model using only 
temporal information and visual data. In order to 
show the exact effectiveness of exploiting non-visual 
data, the weight ratio between visual data and time 
retains the same as in VNV-MSC-Forest with the only 
difference of discarding non-visual data during model 
training. 

7) VPNVp-MSC-Forest : a variant of our model but with 
p% of training samples having arbitrary number of 
missing non-visual types, to evaluate the robustness 
of MSC-Forest in coping with partial/missing non¬ 
visual data. 

Implementation details - The clustering forest size T c i ust 
was set to 1000, including both the conventional forest 
and the proposed MSC-Forest. We observed a slight in¬ 
crease in performance given a larger forest size, which 
agrees with [16]. The training set X* of the 7th MSC- 
tree was obtained by performing random selection with 

7. Evaluating a forest that takes only non-visual inputs is not possible, 
since non-visual data is not available for previously-unseen video footages. 
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Fig. 5. Examples of the (a) TISI and (b) ERCe datasets. 


replacement from the augmented data space (Fig. 3(b)). We 
set m try = yfd with d the data feature dimension (Eqn. (2)). 
This is typically practiced [35]. We employed linear data 
separation [16] as the test function for node splitting. We set 
the same number of clusters across all methods. This cluster 
number was discovered automatically using the method 
presented in [50]. For each dataset, ^ 75% out of the total 
data was utilised for model training, and the remaining was 
reserved for testing. Additional previously-unseen video 
data was collected from the Time Square Intersection scene 
on a separate day for video summarisation. 

5 Evaluations 

5.1 Multi-Source Clustering 

To evaluate the effectiveness of different clustering models 
for multi-source video clustering, we compared the qual¬ 
ity of their clusters formed on the training dataset. For 
determining clustering quality, we quantitively measured 
the mean entropy [55] of non-visual distributions p(yi\c) 
(Eqn. (10)) associated with training data clusters to evaluate 
how coherent video content are partitioned, assuming all 
methods have access to non-visual data during the entropy 
computation. 

It is evident from Table 2 that our VNV-MSC-Forest 
achieves the best cluster purity on both datasets 8 . Despite 
that there are gradual degradations in clustering quality 

8. VNV-MSC-Forest-hard shares the same clusters as VNV-MSC- 
Forest. 


TABLE 2 

Compare cluster purity in mean entropy. Lower is 
better. 


Dataset 

TISI 

ERCe 

p( yl c ) 

traffic speed 

weather 

event 

VO-Forest 

0.8675 

1.0676 

0.0616 

VNV-Kmeans 

0.9197 

1.4994 

1.2519 

VNV-Forest 

0.8611 

1.0889 

0.0811 

VNV-AASC 

0.7217 

0.7039 

0.0691 

VT-MSC-Forest 

0.7275 

0.9577 

0.0580 

VNV-MSC-Forest 

0.7262 

0.6071 

0.0024 

VPNVIO-MSC-Forest 

0.7190 

0.6261 

0.0024 

VPNV 20-MSC-Forest 

0.7283 

0.6497 

0.0090 


when we increase the non-visual data missing proportion, 
overall the VNV-MSC-Forest model copes well with par¬ 
tial/missing non-visual data. With no aid of non-visual tag 
information, VT-MSC-Forest forms much worse clusters. 
Whilst the superiority of VT-MSC-Forest over VO-Forest 
suggests the effectiveness of temporal information with 
MSC-Forest. Inferior performance of VO-Forest to VNV- 
MSC-Forest suggests the importance of learning from aux¬ 
iliary non-visual sources. Nevertheless, not all methods per¬ 
form equally well when learning from the same visual and 
non-visual sources: the k -means and AASC perform much 
poorer in comparison to MSC-Forest. The results suggest 
the proposed joint information gain criterion (Eqn. (4)) 
is more effective in handling heterogeneous data than the 
conventional clustering models. 

For qualitative comparison, we show examples in Fig. 6 
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VO-Forest 

(43/45) 


VNV-Kmeans 

(14/75) 


VNV-Forest 

(24/29) 


VNV-AASC 

(372/1324) 


VT-MSC-Forest 

(60/67) 


VNV-MSC-Forest 

(58/58) 


VPNV10-MSC-Forest 
(50/73) 


VPNV20-MSC-Forest 

(29/31) 



Fig. 6. Qualitative comparison on cluster quality on TISI. A key frame of each video is shown. (X/Y) in brackets: X = the number of clips 
with sunny weather; Y = the total number of clips in a cluster. The frames inside the red boxes are inconsistent clips in a cluster. 


using the TISI dataset for detecting ‘sunny’ weather. It 
is evident that only VNV-MSC-Forest is able to provide 
coherent video grouping, with only slight decrease in clus¬ 
tering purity given partial/missing non-visual data. Other 
methods including VNV-AASC result in a large cluster 
either leaving out some relevant clips or including many 
non-relevant ones, with most of them under the influence 
of strong artificial lighting sources. These non-relevant clips 
are visually ‘close’ to sunny weather, but semantically not. 
The VNV-MSC-Forest model avoids this mistake by corre¬ 
lating both visual and non-visual sources in an information 
theoretic sense. 

5.2 Video Tagging 

Generating video summary with semantical interpretations 
requires accurate tag prediction. In this experiment we 
compared the performance of different methods in inferring 
semantic tags given previously-unseen clips extracted from 
long videos. The proposed tagging algorithm (Section 3.2) 
is used for VO-Forest, VT-MSC-Forest, VNV-MSC-Forest, 
and VPNV10/20-MSC-Forest, whilst nearest neighbour 
(NN) strategy for the others. For quantitative evaluation, 
we manually annotated 3 weather conditions (sunny, cloudy 
and rainy) and 4 traffic speeds on TISI previously-unseen 
clips, whilst 8 event categories on ERCe previously-unseen 
clips. 


TABLE 3 

Comparison of tagging accuracy on the TISI dataset. 


(%) 

traffic speed 

weather 

VO-Forest 

27.62 

50.65 

VNV-Kmeans 

37.80 

43.14 

VNV-Forest 

34.95 

43.81 

VNV-AASC 

36.13 

44.37 

VNV-MSC-Forest-hard 

32.86 

49.59 

VT-MSC-Forest 

35.99 

54.47 

VNV-MSC-Forest 

35.77 

61.05 

VPNV 10-MSC-Forest 

37.99 

55.99 

VPNV 20-MSC-Forest 

38.05 

54.97 


Tagging video by weather and traffic conditions - The 

experiment was conducted on the TISI outdoor dataset. 
It is observed that the performance of different methods 
(Table 3) is largely in line with their performance in 
data clustering (Section 5.1). The poorest result of tagging 
traffic conditions is yielded by VO-Forest. This suggests 
the significance of exploiting non-visual data during model 
training. It is also seen from Fig. 7 that VNV-MSC-Forest 
not only outperforms other baselines in isolating the sunny 
weather, but also performs well in distinguishing visually 
ambiguous cloudy and rainy weathers. In contrast, both 
VNV-Kmeans and VNV-AASC mistake most of the ‘rainy’ 
scenes as either ‘sunny’ or ‘cloudy’, as they can be visually 
similar. 
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Sunny Cloudy Rainy Sunny Cloudy Rainy Sunny Cloudy Rainy Sunny Cloudy Rainy 



Sunny 

Cloudy 


VO-Forest VNV-Kmeans VNV-Forest VNV-AASC 

3-ann 


Rainy 

VNY-MSC-Forest-hard VNV-MSC-Forest VT-MSC-Forest VPNV20-MSC-Forest 


Fig. 7. Weather tagging confusion matrices (TISI dataset). 


TABLE 4 

Comparison of tagging accuracy on the ERCe dataset. 


(%) 

VO-Forest 

VNV-Kmeans 

VNV-Forest 

VNV-AASC 

VNV-MSC-Forest-hard 

VT-MSC-Forest 

VNV-MSC-Forest 

VPNV 10-MSC-Forest 

VPNV20-MSC-Forest 

No Schd. Event 

79.48 

87.91 

32.47 

48.51 

81.25 

57.43 

55.98 

47.96 

55.57 

Cleaning 

39.50 

19.33 

30.25 

45.80 

41.60 

70.17 

41.28 

46.64 

46.22 

Career Fair 

94.41 

59.38 

65.46 

79.77 

70.07 

91.45 

100.0 

100.0 

100.0 

Gun Forum 

74.82 

44.30 

45.77 

84.93 

60.48 

79.96 

83.82 

85.29 

85.29 

Group Studying 

92.97 

46.25 

41.25 

96.88 

84.22 

99.22 

97.66 

97.66 

95.78 

Schlr Comp. 

82.74 

16.71 

33.15 

89.40 

82.88 

90.08 

99.46 

99.73 

99.59 

Accom. Service 

0.00 

0.00 

13.70 

21.15 

10.82 

0.00 

37.26 

37.26 

37.02 

Stud. Orient. 

60.94 

9.77 

33.59 

38.87 

47.85 

43.75 

88.09 

92.38 

88.09 

Average 

65.61 

35.45 

36.96 

63.16 

59.89 

66.50 

75.69 

75.87 

75.95 


No Schd. Event 
Cleaningl 
Career Fair 
Gun Forum 
Group Studying 
Schlr Comp. 
Accom. Service 
Stud. Orient.! 

No Schd. Event 
Cleaningl 
Career Fair 
Gun Forum 
Group Studying 
Schlr Comp. 
Accom. Service 
Stud. Orient. 



VN V-MS C -F orest-hard VNV-MSC-Forest 


VT-MSC-Forest VPNV20-MSC-Forest 


Fig. 8. Event tagging confusion matrices (ERCe dataset). 


Tagging video by activity events - Tagging semantic 
events was tested using the ERCe dataset. By VO-Forest, 
poor results (Table 4 and Fig. 8) are obtained especially 
on Accom. Service, which involves subtle activity patterns, 
i.e. students visiting particular rooms, suggesting using 
visual data alone is not sufficient to detect such events. 
VT-MSC-Forest over-fits to ’Cleaning’ event, therefore 
performs poorly on ’Stud. Orient’ event. 

Due to the typical high-dimension of visual sources com¬ 
pared to non-visual data, the latter is often overwhelmed by 
the former in representation. VNV-Kmeans severely suffers 
from this problem as its most predictions are biased to No 
Schd. Event that is more common and frequent visually. 
This suggests that this distance-based clustering is poor 


in handling the heteroscedasticity and dimension discrep¬ 
ancy problems in learning heterogeneous data. VNV-AASC 
attempts to circumvent these problems by seeking for an 
optimal combination of affinity matrices derived indepen¬ 
dently from distinct data sources. However this is proved 
challenging, particularly when each source is inherently 
noisy and inaccurate. In contrast, the proposed MSC-Forest 
correlates different sources via a joint information gain 
criterion to effectively alleviate these problems, leading 
to more robust and accurate tagging performance. Again, 
VPNV10/20-MSC-Forest perform comparably to VNV- 
MSC-Forest, further validating the robustness of MSC- 
Forest in tackling partial/missing non-visual data with the 
proposed adaptive weighting mechanism (Section 2.2.1). 
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Interestingly, in some cases, VPNV10/20-MSC-Forest 
models even outperform VNV-MSC-Forest slightly. We 
observe that this can be caused by missing noisy non-visual 
data, which may lead to better results. Overall, the perfor¬ 
mance difference is marginal and the results demonstrate 
that MSC-Forest provides stable tagging results across both 
datasets. 

a sensitivity - We analyse the relative significance of visual 
data against non-visual and temporal data by varying its 
weight a v (Eqn. (4)) in MSC-Forest during model training. 
The average tagging accuracy is utilised as performance 
measure criterion. It is observed from Fig. 9 that setting 
a v = 0.5 achieves satisfactory results for both datasets. 
This observation suggests that visual and non-visual data 
are almost equally informative. This setting of a is adopted 
throughout our experiments. 



The weight of visual data The weight of visual data 


(a) TISI. (b) ERCe. 

Fig. 9. The average tagging accuracy against varying 
visual data weight a v in Eqn. (4). 

5.3 Semantic Video Summarisation 

In this experiment, we follow the method described in Sec¬ 
tion 3, and show that the learned model MSC-Forest can be 
easily extended to produce compact yet meaningful video 
summary of previously-unseen video footage, e.g. from 
the Time Square Intersection scene, with automatically 
generated semantic tags. Despite captured from the same 
scene as the TISI dataset, this previously-unseen video 
is challenging in that it contains a number of events not 
seen before (e.g. scaffolding event), with very different 
weather and traffic conditions. It is interesting to examine 
how well the multi-source model could generalise for 
drawing meaningful summarisation given such unexpected 
disparities. 

5.3.1 A Quantitative Evaluation on Summary Quality 
Measuring the quality of video sumary quantitatively is 
non-trivial since there is no formal definition in the liter¬ 
ature. In this study, we employ a coverage metric - an 
ideal summary should cover as many events of interest 
as possible 9 . More precisely, given a video summary V, 
its coverage is defined as c = Nc ^ e ^ ed ^, w here 

^covered and 7V a n represent the number of covered and 
ah events of interest, respectively. The |V| is the length 
of the current summary, whilst max^ |V$| represents the 

9. The event of interest is analogous to important objects/regions in [11]. 


maximum length of ah comparative synopses. The term 
m yi ) thus penalises a summary with longer length, 
igher coverage is better, implying lower redundancy. 

In order to generate unbiased ground truth of event of 
interest, we asked 10 annotators to watch the previously- 
unseen video carefully and label each video clip with 
arbitrary event tags. Although these event tags were pro¬ 
duced independently in a somewhat subjective manner, the 
repetition of similar tagging among different annotators is 
high, e.g. most annotators labelled ‘unloading scaffolding 
tubes’, ‘policemen on-duty’, as events of their interest. 
Thus, we formed the ground truth with events that were 
agreed by over 50% of the annotators. The final ground 
truth consists of 12 events (Fig. 10). 

Given the ground truth, we compared the quality of 
summary generated using the proposed multi-source MSC- 
Forest with the baselines: (1) Uniform-Sampling: a straight¬ 
forward way of summarising video by uniformly sampling 
video clips over time, assuming key events are distributed 
evenly [32], [11]. (2) Sufficient-Change: a classical sum¬ 
marisation strategy generic to video category [31], [56], 
[32]. The idea is to select the clip significantly different 
from the previous key clip e.g. using threshold based 
strategy and thus the extracted key clips may be of great 
diversity and complete. The threshold can be estimated 
based on the number of key clips. For the distance met¬ 
ric, we adopt LI-norm and L2-norm to measure pairwise 
similarity between clips in our experiment. (3) VO-Forest: 
the conventional Forest [35] that exploits visual features 
alone. For VO-Forest and MSC-Forest, we applied the 
summarisation pipeline described in Section 3 for summary 
composition. We generated the video summary by the 
remaining methods via setting a duration similar to the 
summary by MSC-Forest. Note that non-visual information 
are not available during the summarisation stage. Hence, 
for clustering based models, the quality of a summary 
essentially ties to the purity and coherency of video clusters 
discovered using different methods. 

The results are shown in Fig. 10 and Table 5. It is evident 
that the MSC-Forest model achieves higher event coverage 
than the baselines. This is in large due to the MSC-Forest’s 
ability for latent data structure discovery (Section 5.1). To 
reveal concrete reasons on the summarising performance 
difference, for the same previously-unseen samples x* with 
event of interest, e.g. parcel delivery, we compared the 
assigned clusters: c* nv by our model and c* Q by VO-Forest. 
It is found that samples in c* nv are visually consistent 
each other and the majority share some similarity with x*, 
e.g. someone standing at the edge of pathway; whilst cluster 
c* Q is much larger with no obvious visual commonality over 
its cluster members. Uniform-Sampling performs poorly 
since the assumption of uniform event distribution is often 
invalid. Significant-Change is inferior to our model since 
the visual data distance/similarity measure can be inaccu¬ 
rate and less meaningful due to the challenging semantic 
gap problem. 
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Fig. 10. The multi-source affinity matrix constructed by our model, along with key frames corresponding to ground truth events of interest: 
(1) policemen on-duty, (2) blocking pathway, (3) workers unloading scaffolding tubes, (4)-(6) different stages of scaffolding, (7)(9)(10) van 
parking aside, (8) parcel delivery, (11 )(12) loitering events. The event covered by some particular method is indicated on the left-bottom 
corner of key frame with their ID defined as: (a) Uniform-Sampling; (b) Sufficient-Change(Ll); (c) Sufficient-Change(L2); (d) VO-Forest; (e) 
VNV-MSC-Forest. 


TABLE 5 

Quantitative comparison of summary. Length = clip 
number. 


Method 

Length 

Event number 

Coverage 

U niform- S ampling 

28 

3 

25.9% 

Sufficient-Change(Ll) 

29 

2 

16.7% 

Sufficient-Change(L2) 

29 

4 

33.3% 

VO-Forest 

21 

3 

34.5% 

VNV-MSC-Forest(Ours) 

28 

7 

60.4% 


5.3.2 A User Study on Summary Quality 

We conducted a user study to examine if the non-visual tags 
inferred using the MSC-Forest model could complement 
the unilateral perspective offered by pure visual summary 
alone. We showed two video summaries to 10 volunteers: 
(i) a pure visual summary, and (ii) the same summary but 
enriched with semantic tags inferred using the proposed 
multi-source model 10 . The tagged summary is shown in 
Fig. 11. Each volunteer was asked to compare and rate 
the two summaries based on their preference. It is worth 
pointing out that passing the user test is challenging because 
providing additional non-visual tags to summary is not 
necessarily better than none. Tags that correlate poorly with 
visual context could even jeopardise user experience. 

10. The inferred non-visual tags include weather, traffic conditions, and 
typicality. The typicality tag, i.e. usual and interesting , of each clip, is 
computed based on the size of their assigned clusters (Fig. 4(c)). Clips 
assigned to the top 20% smallest clusters are treated as ‘interesting’. 


It is evident from Fig. 12 that visual summary augmented 
with non-visual tags was well accepted by all participants 
over the conventional visual-only summary. A follow¬ 
up survey with the volunteers reveals several interesting 
reasons of their selection. Many volunteers found that the 
inferred non-visual tags were valuable in providing auxil¬ 
iary context to achieve better global situational awareness. 
In particular, the tags helped them to ‘connect the dots’ 
and making sense of the previously-unseen (and likely 
unfamiliar) video footages. Some other volunteers credited 
the additional non-visual tags in focusing their attention on 
particular events, and helping them in spotting ‘outliers’ of 
interest. 

This user study provides an independent means to anal¬ 
yse and validate the usefulness of visual summarisation 
with auto-tag inference of previously-unseen video footages 
without a priori semantics or meta-data, mostly typical 
of surveillance videos. It also shows the effectiveness of 
the proposed model for mapping multi-source non-visual 
information to unstructured and previously-unseen video 
data in automatic tagging and summarisation of the videos. 

5.4 Multi-Source Model Visualisation 

The superior performance of VNV-MSC-Forest can be 
better explained by examining more closely the capacity 
of MSC-Forest in uncovering and exploiting the intrinsic 
correlation among different visual sources and more crit¬ 
ically among visual and non-visual sources. This indirect 
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Typicality: usual 
Weather: cloudy 
Traffic: medium 
Time: 05:00 am 




Typicality: usual 
Weather: sunny 
Traffic: slow 
Time: 15:21 pm 


A 





Typicality: usual 
Weather: sunny 
Traffic: slow 
Time: 13:38 pm 


Typicality: usual 
Weather: sunny 
Traffic: slow 
Time: 10:24 am 


Fig. 11. A storyboard version of our video summary enriched with non-visual tags. 



Fig. 12. User study: tagged versus pure-visual summary. 


correlation among heterogeneous sources results in well- 
structured decision trees, subsequently leading to more con¬ 
sistent data clusters and more accurate semantics inference. 
The details of computing the multi-source correlation are 
presented in Appendix A. Here we show an example multi¬ 
source correction revealed by our MSC-Forest for model 
visualisation purpose. 

Intuitively, vehicle and person counts should correlate in 
a busy scene like TISI. Our MSC-Forest discovered this 
correlation (see Fig. 13(a)), so the less reliable vehicle 
detection from distance against a cluttered background, 
could enjoy a latent support from more reliable person 
detection in regions 5-16 close to the camera view. 

Moreover, visual sources also benefit from correlated 
support from non-visual data through our cross-sources 
information gain optimisation (Eqn. (4)). An example is 
the intuitive correlation between traffic speed and visual 
appearance, e.g. slow traffic speed often corresponds to 
crowded scenarios with a large quantity of pedestrians and 
vehicles whilst fast traffic speed to sparse people and cars. 
Such cross-source correlation can be captured by our MSC- 
Forest, as observed in Fig. 10(b) that the vehicle detection 
responses over road area present a stronger interaction with 
traffic speed data than those on walk path where vehicles 
should not appear. In other words, vehicle detection features 
of road area are preferred over those on walk path in node 
splitting due to larger induced joint information gain (Eqn. 
(4)), which is clearly desired. This discovered correlation is 
further exploited by MSC-Forest during the node splitting 
optimisation process and thus facilitates the separation of 
different crowdedness levels of visual data. This leads to 



vo person detection in regions 1-16 


(a) Visual-visual. 



(b) Vehicle detection and traffic speed. 


Fig. 13. The discovered multi-source correlation by our 
MSC-Forest on TISI. 


better clusters and eventually benefits video summarisation. 

5.5 Computational Costs and Model Complexity 

We examined the computational costs for training the 
proposed MSC-Forest, in comparison to the conventional 
forests. Time is measured on a Windows PC machine with 
a dual-core CPU @ 2.66 GHz, 4.0GB RAM, with C++ 
implementation. Only one core is utilised for training each 
forest. We recorded the model training time under the same 
experimental setting as stated in Section 4. It is observed 
from Table 6 that the training cost of a MSC-Forest model 
is significantly lower than that of learning conventional 
forests. In particular, VNV-MSC-Forest records a reduced 
training time by 14.4% and 17.1% on TISI, and 64.1% 
and 64.4% on ERCe, when compared with VO-Forest and 
VNV-Forest, respectively. We observed similar trend on the 
model inference time. 










































16 


The lower computational cost of MSC-Forest is owing to 
its shallow and balanced trees, thanks to the additional non¬ 
visual and temporal information during tree optimisation. 
To make this concrete, we showed in Table 6 the averaged 
tree fan-in <f>* = j^~ t Y^J clust <f>(£) of different forest mod¬ 
els. A forest with shallow and balanced trees tend to have 
a small T>* (see Section 2.2.2 for a discussion on tree fan- 
in). In addition, we also profiled the length of path (from 
root to leaf node) traversed by training samples. A shallow 
and balanced tree tends to have shorter path length. The 
distributions depicted in Fig. 14 suggest that MSC-Forest 
has a shallower and more balanced tree topology than 
that of conventional forests. It is worth pointing out that 
despite the shallower structure, MSC-Forest outperforms 
other models in our clustering and tagging experiments. 

TABLE 6 

Random forest model training complexity. Lower is 
better. TT = Training Time (unit is second). 


Dataset 

TISI 

ERCe 

- 

TT 

<|>* 

TT 

<|>* 

VO-Forest 

10306 

109392 

21831 

359247 

VNV-Forest 

10646 

108865 

22015 

359364 

VNV-MSC-Forest 

8823 

91316 

7845 

137620 


TISI 


a> 400r 



Path Length 



Path Length 


Fig. 14. Comparing tree path length statistics. The 
same legend is used for both charts. 


6 Conclusion and Future Work 

We have presented a novel unsupervised multi-source 
learning model for video summarisation. Specifically, we 
introduced a joint information gain function for discovering 
and exploiting latent correlations among independent het¬ 
erogeneous data sources. The function naturally copes with 
diverse types of data with different representations, distri¬ 
butions, and dimensions. Importantly, our model is capable 
of tolerating partial and missing non-visual data, lending 
it well for automatic semantic tag inference on previously- 
unseen video footages and for video summarisation. Fur¬ 
thermore, the proposed joint optimisation encourages more 


compact decision trees, leading to more efficient model 
training and semantic tag inference. Comparative experi¬ 
ments have demonstrated the advantages of the proposed 
multi-source video clustering model over existing visual- 
only models, for both discovering latent video clusters and 
inferring non-visual semantic tags on previously-unseen 
video footages. A comprehensive user study was carried 
out to validate independently the effectiveness of deploying 
the proposed model for generating contextually-rich and 
semantically-meaningful video summary. 

The proposed model is not limited to surveillance-type 
videos but can be generalised to other types of unstructured 
and un-tagged consumer videos or egocentric videos, if 
3D camera motion-invariant features or egocentric fea¬ 
tures [11] are adopted. For future work, we will consider 
generalising/transferring a learned model to new scenes that 
are significantly different from the training environments. 
This can be partly addressed by utilising intermediate data 
representations such as attributes. 


Appendix A 

Quantifying Correlation between 
Sources 


Quantifying latent correlation between different sources 
gives insights into their interactions in forming coherent 
video groupings. This can be done once a MSC-Forest is 
trained. To quantify between-source correlation, we first 
estimate correlation among their constituent features. 

Visual-visual feature correlation - Visual-visual feature 
correlation is typically quantified based on their similarity 
in inducing split node partitions L and R [35]. In particular, 
given a split node s and its final optimal split, say L v and 
R v by feature v. From Eqn. (2), we recall that this feature 
v is selected out from the m try randomly sampled features 
F s = {/i, ..., fm try }• Let r G F s \u and its optimal left- 
right partitions be L r and R r respectively. The node-level 
correlation between features v and r is then defined as 




Pv~{ 1 - 


\L v nL^\ 

\LuURy\ 


Vv 


\R u nR T | \ 
\L V UR V \) 


(16) 


where p„ = thus p„ € (0, §]. 

With Eqn. (16) we assign a strong correlation (A f{y,r) = 
1) to a feature pair (v, r) if they produce the same data 
partition, whilst a weak correlation (A f{y,r) < — 1) when 
their partitions have no overlaps. For simplicity we let 
Aj(z/, r) = max(A/(z/, r), 0) such that Aj(i/, r) lies in the 
range of [0,1]. The final visual-visual feature correlation 
A (z/, t) is obtained via 


A(z/,r) 


1 

Tclust 


E 


t =l 


w- E -vm . 

k 


(17) 


where N* u refers to the number of sampling co¬ 
occurrences of a feature pair ( v , r) during the splitting 
process of a MSC-tree t. 
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Visual-nonvisual feature correlation - Recall that visual 
and non-visual data play different roles in our MSC-Forest, 
e.g. the former as splitting features whereas the later as aux¬ 
iliary information. This difference makes the above equa¬ 
tions not applicable to the computation of visual-nonvisual 
feature correlation since no data split is associated with 
non-visual features. Instead, we adopt information gain as 
the visual-nonvisual feature correlation metric. This metric 
is appropriate in that it also reflects the intrinsic mutual 
interaction between visual and non-visual features during 
joint information gain optimisation (Eqn. (4)). Formally, 
we quantify the node-level correlation between the optimal 
splitting visual feature v and a non-visual feature uo as 
A uj) = ^5^ (the non-visual term of Eqn. (4)). The final 
visual-nonvisual feature correlation A(z/, uS) is computed 
similarly by Eqn. (17). 

Correlation between sources - Given between-feature 
correlation, the final correlation between any two sources 
and can then be estimated through 

= TtTrn H Kv,r). (18) 
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